Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]
SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.
INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.
The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.
The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?
ANALYSIS: The baseline performance of the seven algorithms achieved an average ROC score of 0.6965. Two algorithms, Decision Tree and Random Forest, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved an ROC score of 0.7159. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with an ROC score of 0.5285, which was significant below the result from the training data.
CONCLUSION: For this iteration, the Random Forest algorithm achieved the leading ROC scores using the training and validation datasets. For this dataset, the Random Forest algorithm does not appear to be sufficiently adequate for production use. Further modeling and testing is recommended for the next step.
Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)
One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(mailR)
library(parallel)
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(stringr)
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
email_notify <- function(msg=""){
sender <- "luozhi2488@gmail.com"
receiver <- "dave@contactdavidlowe.com"
sbj_line <- "Notification from R Script"
password <- readLines("../email_credential.txt")
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"
# Read the list of attribute names from a file
attrFile = "TicAttributes.txt"
conn <- file(attrFile, open="r")
lines <- readLines(conn)
close(conn)
colNames <- c()
for (i in 1:length(lines)) {
colNames <- c(colNames,word(lines[i]))
}
# Import the records for the training dataset
inputFile = "ticdata2000.txt"
xy_train <- read.csv(inputFile, header = FALSE, sep = "\t", col.names = colNames)
# Standardize the class column to the name of targetVar
xy_train$targetVar <- "Yes"
xy_train$targetVar[xy_train$CARAVAN==0] <- "No"
xy_train$targetVar <- as.factor(xy_train$targetVar)
xy_train$targetVar <- relevel(xy_train$targetVar, "Yes")
xy_train$CARAVAN <- NULL
cat("Number of training rows and columns imported into xy_train:", nrow(xy_train), "by", ncol(xy_train), "\n")
## Number of training rows and columns imported into xy_train: 5822 by 86
# Import the records for the test/eval dataset without the target variable
noTargetCol <- colNames[-length(colNames)]
inputFile = "ticeval2000.txt"
x_test <- read.csv(inputFile, header = FALSE, sep = "\t", col.names = noTargetCol)
cat("Number of training rows and columns imported into x_test:", nrow(x_test), "by", ncol(x_test), "\n")
## Number of training rows and columns imported into x_test: 4000 by 85
# Import the records for the test/eval dataset with only the target variable
inputFile = "tictgts2000.txt"
y_test <- read.csv(inputFile, header = FALSE, col.names = c("CARAVAN"))
y_test$targetVar <- "Yes"
y_test$targetVar[y_test$CARAVAN==0] <- "No"
y_test$targetVar <- as.factor(y_test$targetVar)
y_test$targetVar <- relevel(y_test$targetVar, "Yes")
y_test$CARAVAN <- NULL
cat("Number of training rows and columns imported into y_test:", nrow(y_test), "by", ncol(y_test), "\n")
## Number of training rows and columns imported into y_test: 4000 by 1
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(xy_train)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
# training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
# xy_train <- originalDataset[training_index,]
# xy_test <- originalDataset[-training_index,]
if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
xy_test <- cbind(y_test, x_test)
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
xy_test <- cbind(x_test, y_test)
y_test <- xy_test[,targetCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 5
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 5 by 17
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1, classProbs=TRUE, summaryFunction=twoClassSummary)
metricTarget <- "ROC"
email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@546a03af}"
head(xy_train)
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE
## 1 33 1 3 2 8 0 5 1 3
## 2 37 1 2 2 8 1 4 1 4
## 3 37 1 2 2 8 0 4 2 4
## 4 9 1 3 3 3 2 3 2 4
## 5 40 1 4 2 10 1 4 1 4
## 6 23 1 2 1 5 0 5 0 5
## MRELGE MRELSA MRELOV MFALLEEN MFGEKIND MFWEKIND MOPLHOOG MOPLMIDD
## 1 7 0 2 1 2 6 1 2
## 2 6 2 2 0 4 5 0 5
## 3 3 2 4 4 4 2 0 5
## 4 5 2 2 2 3 4 3 4
## 5 7 1 2 2 4 4 5 4
## 6 0 6 3 3 5 2 0 5
## MOPLLAAG MBERHOOG MBERZELF MBERBOER MBERMIDD MBERARBG MBERARBO MSKA
## 1 7 1 0 1 2 5 2 1
## 2 4 0 0 0 5 0 4 0
## 3 4 0 0 0 7 0 2 0
## 4 2 4 0 0 3 1 2 3
## 5 0 0 5 4 0 0 0 9
## 6 4 2 0 0 4 2 2 2
## MSKB1 MSKB2 MSKC MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS MZPART
## 1 1 2 6 1 1 8 8 0 1 8 1
## 2 2 3 5 0 2 7 7 1 2 6 3
## 3 5 0 4 0 7 2 7 0 2 9 0
## 4 2 1 4 0 5 4 9 0 0 7 2
## 5 0 0 0 0 4 5 6 2 1 5 4
## 6 2 2 4 2 9 0 5 3 3 9 0
## MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM MKOOPKLA PWAPART
## 1 0 4 5 0 0 4 3 0
## 2 2 0 5 2 0 5 4 2
## 3 4 5 0 0 0 3 4 2
## 4 1 5 3 0 0 4 4 0
## 5 0 0 9 0 0 6 3 0
## 6 5 2 3 0 0 3 3 0
## PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO PVRAAUT PAANHANG PTRACTOR
## 1 0 0 6 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 6 0 0 0 0 0
## 4 0 0 6 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 6 0 0 0 0 0
## PWERKT PBROM PLEVEN PPERSONG PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER
## 1 0 0 0 0 0 0 5 0 0
## 2 0 0 0 0 0 0 2 0 0
## 3 0 0 0 0 0 0 2 0 0
## 4 0 0 0 0 0 0 2 0 0
## 5 0 0 0 0 0 0 6 0 0
## 6 0 0 0 0 0 0 0 0 0
## PFIETS PINBOED PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## 1 0 0 0 0 0 0 1 0 0
## 2 0 0 0 2 0 0 0 0 0
## 3 0 0 0 1 0 0 1 0 0
## 4 0 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 1 0 0
## AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG AGEZONG AWAOREG
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND targetVar
## 1 1 0 0 0 0 0 No
## 2 1 0 0 0 0 0 No
## 3 1 0 0 0 0 0 No
## 4 1 0 0 0 0 0 No
## 5 1 0 0 0 0 0 No
## 6 0 0 0 0 0 0 No
dim(xy_train)
## [1] 5822 86
sapply(xy_train, class)
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MGODOV MGODGE MRELGE MRELSA MRELOV MFALLEEN MFGEKIND
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MFWEKIND MOPLHOOG MOPLMIDD MOPLLAAG MBERHOOG MBERZELF MBERBOER
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MBERMIDD MBERARBG MBERARBO MSKA MSKB1 MSKB2 MSKC
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MZPART MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MKOOPKLA PWAPART PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## PVRAAUT PAANHANG PTRACTOR PWERKT PBROM PLEVEN PPERSONG
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER PFIETS PINBOED
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## AGEZONG AWAOREG ABRAND AZEILPL APLEZIER AFIETS AINBOED
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## ABYSTAND targetVar
## "integer" "factor"
summary(xy_train)
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF
## Min. : 1.00 Min. : 1.000 Min. :1.000 Min. :1.000
## 1st Qu.:10.00 1st Qu.: 1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :30.00 Median : 1.000 Median :3.000 Median :3.000
## Mean :24.25 Mean : 1.111 Mean :2.679 Mean :2.991
## 3rd Qu.:35.00 3rd Qu.: 1.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :41.00 Max. :10.000 Max. :5.000 Max. :6.000
## MOSHOOFD MGODRK MGODPR MGODOV
## Min. : 1.000 Min. :0.0000 Min. :0.000 Min. :0.00
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:0.00
## Median : 7.000 Median :0.0000 Median :5.000 Median :1.00
## Mean : 5.774 Mean :0.6965 Mean :4.627 Mean :1.07
## 3rd Qu.: 8.000 3rd Qu.:1.0000 3rd Qu.:6.000 3rd Qu.:2.00
## Max. :10.000 Max. :9.0000 Max. :9.000 Max. :5.00
## MGODGE MRELGE MRELSA MRELOV
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.00
## 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:0.0000 1st Qu.:1.00
## Median :3.000 Median :6.000 Median :1.0000 Median :2.00
## Mean :3.259 Mean :6.183 Mean :0.8835 Mean :2.29
## 3rd Qu.:4.000 3rd Qu.:7.000 3rd Qu.:1.0000 3rd Qu.:3.00
## Max. :9.000 Max. :9.000 Max. :7.0000 Max. :9.00
## MFALLEEN MFGEKIND MFWEKIND MOPLHOOG
## Min. :0.000 Min. :0.00 Min. :0.0 Min. :0.000
## 1st Qu.:0.000 1st Qu.:2.00 1st Qu.:3.0 1st Qu.:0.000
## Median :2.000 Median :3.00 Median :4.0 Median :1.000
## Mean :1.888 Mean :3.23 Mean :4.3 Mean :1.461
## 3rd Qu.:3.000 3rd Qu.:4.00 3rd Qu.:6.0 3rd Qu.:2.000
## Max. :9.000 Max. :9.00 Max. :9.0 Max. :9.000
## MOPLMIDD MOPLLAAG MBERHOOG MBERZELF
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:0.000 1st Qu.:0.000
## Median :3.000 Median :5.000 Median :2.000 Median :0.000
## Mean :3.351 Mean :4.572 Mean :1.895 Mean :0.398
## 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :5.000
## MBERBOER MBERMIDD MBERARBG MBERARBO
## Min. :0.0000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:1.000
## Median :0.0000 Median :3.000 Median :2.00 Median :2.000
## Mean :0.5223 Mean :2.899 Mean :2.22 Mean :2.306
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :9.0000 Max. :9.000 Max. :9.00 Max. :9.000
## MSKA MSKB1 MSKB2 MSKC
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :1.000 Median :2.000 Median :2.000 Median :4.000
## Mean :1.621 Mean :1.607 Mean :2.203 Mean :3.759
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
## MSKD MHHUUR MHKOOP MAUT1
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:5.00
## Median :1.000 Median :4.000 Median :5.000 Median :6.00
## Mean :1.067 Mean :4.237 Mean :4.772 Mean :6.04
## 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.00
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.00
## MAUT2 MAUT0 MZFONDS MZPART
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:5.000 1st Qu.:1.000
## Median :1.000 Median :2.000 Median :7.000 Median :2.000
## Mean :1.316 Mean :1.959 Mean :6.277 Mean :2.729
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:8.000 3rd Qu.:4.000
## Max. :7.000 Max. :9.000 Max. :9.000 Max. :9.000
## MINKM30 MINK3045 MINK4575 MINK7512
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :4.000 Median :3.000 Median :0.0000
## Mean :2.574 Mean :3.536 Mean :2.731 Mean :0.7961
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.0000
## MINK123M MINKGEM MKOOPKLA PWAPART
## Min. :0.0000 Min. :0.000 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:0.0000
## Median :0.0000 Median :4.000 Median :4.000 Median :0.0000
## Mean :0.2027 Mean :3.784 Mean :4.236 Mean :0.7712
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.:2.0000
## Max. :9.0000 Max. :9.000 Max. :8.000 Max. :3.0000
## PWABEDR PWALAND PPERSAUT PBESAUT
## Min. :0.00000 Min. :0.00000 Min. :0.00 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :5.00 Median :0.00000
## Mean :0.04002 Mean :0.07162 Mean :2.97 Mean :0.04827
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:6.00 3rd Qu.:0.00000
## Max. :6.00000 Max. :4.00000 Max. :8.00 Max. :7.00000
## PMOTSCO PVRAAUT PAANHANG PTRACTOR
## Min. :0.0000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.1754 Mean :0.009447 Mean :0.02096 Mean :0.09258
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.0000 Max. :9.000000 Max. :5.00000 Max. :6.00000
## PWERKT PBROM PLEVEN PPERSONG
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.000 Median :0.0000 Median :0.00000
## Mean :0.01305 Mean :0.215 Mean :0.1948 Mean :0.01374
## 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :6.00000 Max. :6.000 Max. :9.0000 Max. :6.00000
## PGEZONG PWAOREG PBRAND PZEILPL
## Min. :0.00000 Min. :0.00000 Min. :0.000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000000
## Median :0.00000 Median :0.00000 Median :2.000 Median :0.0000000
## Mean :0.01529 Mean :0.02353 Mean :1.828 Mean :0.0008588
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.:0.0000000
## Max. :3.00000 Max. :7.00000 Max. :8.000 Max. :3.0000000
## PPLEZIER PFIETS PINBOED PBYSTAND
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01889 Mean :0.02525 Mean :0.01563 Mean :0.04758
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :6.00000 Max. :1.00000 Max. :6.00000 Max. :5.00000
## AWAPART AWABEDR AWALAND APERSAUT
## Min. :0.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000 Median :0.00000 Median :0.00000 Median :1.0000
## Mean :0.403 Mean :0.01477 Mean :0.02061 Mean :0.5622
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :2.000 Max. :5.00000 Max. :1.00000 Max. :7.0000
## ABESAUT AMOTSCO AVRAAUT AAANHANG
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.01048 Mean :0.04105 Mean :0.002233 Mean :0.01254
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :4.00000 Max. :8.00000 Max. :3.000000 Max. :3.00000
## ATRACTOR AWERKT ABROM ALEVEN
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.03367 Mean :0.006183 Mean :0.07042 Mean :0.07661
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :4.00000 Max. :6.000000 Max. :2.00000 Max. :8.00000
## APERSONG AGEZONG AWAOREG ABRAND
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.000000 Median :0.000000 Median :1.0000
## Mean :0.005325 Mean :0.006527 Mean :0.004638 Mean :0.5701
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.000000 Max. :1.000000 Max. :2.000000 Max. :7.0000
## AZEILPL APLEZIER AFIETS
## Min. :0.0000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.0000000 Median :0.000000 Median :0.00000
## Mean :0.0005153 Mean :0.006012 Mean :0.03178
## 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :2.000000 Max. :3.00000
## AINBOED ABYSTAND targetVar
## Min. :0.000000 Min. :0.00000 Yes: 348
## 1st Qu.:0.000000 1st Qu.:0.00000 No :5474
## Median :0.000000 Median :0.00000
## Mean :0.007901 Mean :0.01426
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :2.000000 Max. :2.00000
#entireDataset_x <- entireDataset[,1:(totCol-1)]
#entireDataset_y <- entireDataset[,totCol]
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
## freq percentage
## Yes 348 5.977327
## No 5474 94.022673
sapply(xy_train, function(x) sum(is.na(x)))
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR
## 0 0 0 0 0 0 0
## MGODOV MGODGE MRELGE MRELSA MRELOV MFALLEEN MFGEKIND
## 0 0 0 0 0 0 0
## MFWEKIND MOPLHOOG MOPLMIDD MOPLLAAG MBERHOOG MBERZELF MBERBOER
## 0 0 0 0 0 0 0
## MBERMIDD MBERARBG MBERARBO MSKA MSKB1 MSKB2 MSKC
## 0 0 0 0 0 0 0
## MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS
## 0 0 0 0 0 0 0
## MZPART MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM
## 0 0 0 0 0 0 0
## MKOOPKLA PWAPART PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO
## 0 0 0 0 0 0 0
## PVRAAUT PAANHANG PTRACTOR PWERKT PBROM PLEVEN PPERSONG
## 0 0 0 0 0 0 0
## PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER PFIETS PINBOED
## 0 0 0 0 0 0 0
## PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## 0 0 0 0 0 0 0
## AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG
## 0 0 0 0 0 0 0
## AGEZONG AWAOREG ABRAND AZEILPL APLEZIER AFIETS AINBOED
## 0 0 0 0 0 0 0
## ABYSTAND targetVar
## 0 0
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(x_train[,i], main=names(x_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(x_train[,i], main=names(x_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(x_train[,i]), main=names(x_train)[i])
}
# Scatterplot matrix colored by class
# pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
featurePlot(x=x_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")
email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@357246de}"
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23223dd8}"
# Accodring to the data dictionary, columns MOSTYPE and MOSHOOFD should be converted to categorical type
xy_train$MOSTYPE <- as.factor(xy_train$MOSTYPE)
xy_train$MOSHOOFD <- as.factor(xy_train$MOSHOOFD)
xy_test$MOSTYPE <- as.factor(xy_test$MOSTYPE)
xy_test$MOSHOOFD <- as.factor(xy_test$MOSHOOFD)
# Not applicable for this iteration of the project.
# Not applicable for this iteration of the project.
dim(xy_train)
## [1] 5822 86
sapply(xy_train, class)
## MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR
## "factor" "integer" "integer" "integer" "factor" "integer" "integer"
## MGODOV MGODGE MRELGE MRELSA MRELOV MFALLEEN MFGEKIND
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MFWEKIND MOPLHOOG MOPLMIDD MOPLLAAG MBERHOOG MBERZELF MBERBOER
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MBERMIDD MBERARBG MBERARBO MSKA MSKB1 MSKB2 MSKC
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MZPART MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## MKOOPKLA PWAPART PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## PVRAAUT PAANHANG PTRACTOR PWERKT PBROM PLEVEN PPERSONG
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER PFIETS PINBOED
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## AGEZONG AWAOREG ABRAND AZEILPL APLEZIER AFIETS AINBOED
## "integer" "integer" "integer" "integer" "integer" "integer" "integer"
## ABYSTAND targetVar
## "integer" "factor"
email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5a01ccaa}"
proc.time()-startTimeScript
## user system elapsed
## 57.782 0.879 68.196
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, three non-linear, and three ensemble algorithms:
Linear Algorithm: Logistic Regression
Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine
Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
# Logistic Regression (Classification)
email_notify(paste("Linear Regression modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1fbc7afb}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7207725 0.01142857 0.9965292
proc.time()-startTimeModule
## user system elapsed
## 18.403 0.197 18.846
email_notify(paste("Linear Regression modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@c818063}"
# Decision Tree - CART (Regression/Classification)
email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2c8d66b2}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.000862069 0.7137003 0.05731092 0.9873927
## 0.002431477 0.6905597 0.02588235 0.9917786
## 0.003831418 0.6020277 0.01428571 0.9967093
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.000862069.
proc.time()-startTimeModule
## user system elapsed
## 7.197 0.010 7.298
email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2752f6e2}"
# k-Nearest Neighbors (Regression/Classification)
email_notify(paste("k-Nearest Neighbors modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1d251891}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.6077055 0.008571429 0.9936031
## 7 0.6152044 0.000000000 0.9981732
## 9 0.6275954 0.000000000 0.9998175
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
## user system elapsed
## 243.999 0.034 246.657
email_notify(paste("k-Nearest Neighbors modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6e8dacdf}"
# Support Vector Machine (Regression/Classification)
email_notify(paste("Support Vector Machine modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7f63425a}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.6385720 0.002857143 0.9990869
## 0.50 0.6389426 0.002857143 0.9990869
## 1.00 0.6394141 0.002857143 0.9989041
##
## Tuning parameter 'sigma' was held constant at a value of 0.00555605
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.00555605 and C = 1.
proc.time()-startTimeModule
## user system elapsed
## 210.238 1.994 214.576
email_notify(paste("Support Vector Machine modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@59a6e353}"
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2812cbfa}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results:
##
## ROC Sens Spec
## 0.6790771 0.07487395 0.9738721
proc.time()-startTimeModule
## user system elapsed
## 85.109 0.605 86.663
email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6aaa5eb0}"
# Random Forest (Regression/Classification)
email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@380fb434}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.6854831 0.00000000 1.0000000
## 66 0.7145401 0.06058824 0.9795374
## 131 0.7211056 0.06630252 0.9755181
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 131.
proc.time()-startTimeModule
## user system elapsed
## 1336.521 2.029 1353.086
email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@42d3bd8b}"
# Stochastic Gradient Boosting (Regression/Classification)
email_notify(paste("Stochastic Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4e04a765}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees ROC Sens Spec
## 1 50 0.7650455 0.000000000 0.9994526
## 1 100 0.7677287 0.002857143 0.9987223
## 1 150 0.7681374 0.002857143 0.9985395
## 2 50 0.7647211 0.002857143 0.9985391
## 2 100 0.7726834 0.005714286 0.9972601
## 2 150 0.7735426 0.014285714 0.9967123
## 3 50 0.7728907 0.005714286 0.9981728
## 3 100 0.7744829 0.020084034 0.9956161
## 3 150 0.7668234 0.028739496 0.9936055
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
## user system elapsed
## 84.865 0.168 85.933
email_notify(paste("Stochastic Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1e67b872}"
results <- resamples(list(LR=fit.glm, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LR, CART, kNN, SVM, BagCART, RF, GBM
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.5424654 0.7187908 0.7346121 0.7207725 0.7505980 0.7771116 0
## CART 0.6335597 0.6848133 0.7312430 0.7137003 0.7401341 0.7581856 0
## kNN 0.5788965 0.6083592 0.6166638 0.6275954 0.6564493 0.6819796 0
## SVM 0.5814390 0.6095456 0.6461774 0.6394141 0.6642008 0.6870177 0
## BagCART 0.6374771 0.6631625 0.6735072 0.6790771 0.6888940 0.7432742 0
## RF 0.6798642 0.7136872 0.7252546 0.7211056 0.7376468 0.7489312 0
## GBM 0.7005746 0.7625714 0.7835808 0.7744829 0.7969225 0.8023981 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## LR 0.00000000 0.00000000 0.00000000 0.011428571 0.02857143 0.02857143
## CART 0.00000000 0.05714286 0.05714286 0.057310924 0.07899160 0.08571429
## kNN 0.00000000 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000
## SVM 0.00000000 0.00000000 0.00000000 0.002857143 0.00000000 0.02857143
## BagCART 0.02857143 0.05714286 0.08571429 0.074873950 0.08760504 0.11764706
## RF 0.02857143 0.05714286 0.05714286 0.066302521 0.08571429 0.11764706
## GBM 0.00000000 0.00000000 0.00000000 0.020084034 0.02920168 0.08571429
## NA's
## LR 0
## CART 0
## kNN 0
## SVM 0
## BagCART 0
## RF 0
## GBM 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.9926874 0.9949726 0.9963504 0.9965292 0.9981718 1.0000000 0
## CART 0.9744059 0.9840037 0.9872146 0.9873927 0.9890511 1.0000000 0
## kNN 0.9981752 1.0000000 1.0000000 0.9998175 1.0000000 1.0000000 0
## SVM 0.9963437 0.9981727 1.0000000 0.9989041 1.0000000 1.0000000 0
## BagCART 0.9634369 0.9693784 0.9744059 0.9738721 0.9781022 0.9817518 0
## RF 0.9670932 0.9712066 0.9771698 0.9755181 0.9781022 0.9835766 0
## GBM 0.9908592 0.9931544 0.9963437 0.9956161 0.9977165 1.0000000 0
dotplot(results)
cat('The average ROC from all models is:', mean(c(results$values$`LR~ROC`, results$values$`CART~ROC`, results$values$`kNN~ROC`, results$values$`SVM~ROC`, results$values$`BagCART~ROC`, results$values$`RF~ROC`, results$values$`GBM~ROC`)))
## The average ROC from all models is: 0.6965926
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Decision Tree (CART)
email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@77b52d12}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(cp = c(0.0001, 0.0005, 0.001, 0.005, 0.01))
fit.final1 <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)
print(fit.final1)
## CART
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 1e-04 0.7034841 0.065966387 0.9853831
## 5e-04 0.7051201 0.065966387 0.9857487
## 1e-03 0.7137003 0.057310924 0.9873927
## 5e-03 0.5205876 0.002857143 0.9994516
## 1e-02 0.5000000 0.000000000 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
proc.time()-startTimeModule
## user system elapsed
## 5.703 0.035 5.802
email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4157f54e}"
# Tuning algorithm #2 - Random Forest (RF)
email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6615435c}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(5, 10, 20, 35, 50))
fit.final2 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final2)
print(fit.final2)
## Random Forest
##
## 5822 samples
## 85 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 5 0.7027004 0.005798319 0.9978076
## 10 0.7100881 0.028823529 0.9908636
## 20 0.7154303 0.037563025 0.9870271
## 35 0.7153575 0.057731092 0.9828244
## 50 0.7159008 0.057731092 0.9809985
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 50.
proc.time()-startTimeModule
## user system elapsed
## 1620.398 2.866 1641.105
email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7225790e}"
results <- resamples(list(CART=fit.final1, RF=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CART, RF
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.6335597 0.6848133 0.7312430 0.7137003 0.7401341 0.7581856 0
## RF 0.6791329 0.6999891 0.7189867 0.7159008 0.7301532 0.7521116 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0 0.05714286 0.05714286 0.05731092 0.07899160 0.08571429 0
## RF 0 0.03571429 0.05714286 0.05773109 0.07857143 0.11764706 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.9744059 0.9840037 0.9872146 0.9873927 0.9890511 1.0000000 0
## RF 0.9689214 0.9789762 0.9817351 0.9809985 0.9849453 0.9872263 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@52af6cff}"
predictions <- predict(fit.final2, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 17 53
## No 221 3709
##
## Accuracy : 0.9315
## 95% CI : (0.9232, 0.9391)
## No Information Rate : 0.9405
## P-Value [Acc > NIR] : 0.9917
##
## Kappa : 0.0857
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.07143
## Specificity : 0.98591
## Pos Pred Value : 0.24286
## Neg Pred Value : 0.94377
## Prevalence : 0.05950
## Detection Rate : 0.00425
## Detection Prevalence : 0.01750
## Balanced Accuracy : 0.52867
##
## 'Positive' Class : Yes
##
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)
auc <- performance(pred, measure = "auc")
cat('The area under the curve (AUC) value is:', auc@y.values[[1]])
## The area under the curve (AUC) value is: 0.5286702
startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(seedNum)
# Combining the training and test datasets to form the original dataset that will be used for training the final model
# xy_train <- rbind(xy_train, xy_test)
finalModel <- randomForest(targetVar~., data=xy_train, mtry=50)
print(finalModel)
##
## Call:
## randomForest(formula = targetVar ~ ., data = xy_train, mtry = 50)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 50
##
## OOB estimate of error rate: 7.71%
## Confusion matrix:
## Yes No class.error
## Yes 24 324 0.93103448
## No 125 5349 0.02283522
proc.time()-startTimeModule
## user system elapsed
## 33.522 0.113 34.004
#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3dd3bcd}"
proc.time()-startTimeScript
## user system elapsed
## 3710.458 9.149 3801.231